home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Celestin Apprentice 5
/
Apprentice-Release5.iso
/
Source Code
/
Libraries
/
DCLAP 6d
/
dclap6d
/
SeqPups
/
appsrc
/
autoseq.src
/
Programmers-Notes.txt
< prev
next >
Wrap
Text File
|
1996-07-05
|
6KB
|
155 lines
================================================================================
PROGRAMMERS' NOTES 80 columns
tab=4 spaces
for anal
analall
bpdagg
CSequence
CPeakList
CTrace
CTraceFile
FileFormat
autoseq
xlate
RIncludes
RInlines
DNA
This document describes conventions and nomenclature for the sources of the
programs mentioned above. This package comprises the sources developed for the
manipulation and analysis of chromatogram data generated by automated sequencing
of DNA.
The project was undertaken for the requirements of a Masters Degree in
Computer Science and Bioengineering Certificate at Washington University in
St. Louis, Missouri, USA. The thesis advisor was David States.
Everything described herein has been released to the public domain. Bug
reports, bug fixes, comments, suggestions, extensions, and the like are
desired.
Contact Addresses:
Reece Hart David States
reece@ibc.wustl.edu states@ibc.wustl.edu
================================================================================
CONTENTS
--------
PROGRAM DESCRIPTIONS
SOURCE DESCRIPTIONS
CONVENTIONS
KNOWN PROBLEMS, QUIRKS, AND IMPROVEMENT SUGGESTIONS
PROGRAM DESCRIPTIONS
--------------------
analyze - Runs a complete analysis suite on a named abi file. See script header
for usage.
analall - Runs anal on all abi files found in the current directory. See script
header for usage.
bpdagg - aggregates bpd files for individual bases into a single list sorted
by base position.
autoseq - an interface to the above modules. It peforms essentially no
calculations itself, but instead directs the classes to perform the actions
themselves.
xlate - coverts ABI to SCF; the only advantage of this over makeSCF is that
input format may be specified to be ABI0, which is the /raw/ data obtained from
the sequencer. This will probably not be of general interest.
SOURCE DESCRIPTIONS
-------------------
All source was written in C++ using AT&T C++ 3.01. I have noted errors during
compilation with g++, but have not yet attempted to correct them.
CSequence - A simple bidirectional linearly-linked list template. It supports
essentially any data type.
CPeakList - Defines the PeakRec structure (class) and some simple methods.
CPeakList is built on a CSequence<PeakRec> and implements many methods for the
manipulation and analysis of a collection of peaks.
CTrace - A template class which stores a large sequence (array) of any
numerical type. It performs many statistical and analytical functions such as
derivatives (returned as a CTrace<double>, from which subsequent derivatives
may be obtained), peak picking, scaling, translating, and I/O.
CTraceFile - Assembles a collection of CTrace's and a number of other
data members which represents any of several formats of chromatograms from
automated DNA sequencing experiments. It currently supports reading and
writing Standard Chromatogram Format (SCF) files, and reading any of the
data sets within an Applied Biosystems, Inc. (ABI) file.
FileFormat - Simple routines for the determination and description of
chromatogram file formats.
RIncludes - a set of common definitions, typedefs, etc.
RInlines - a set of useful inline routines
DNA - some simple DNA definitions and types
CONVENTIONS
-----------
* I've tried to provide a consistent coding style and this style relies
heavily on tab = 4 spaces.
KNOWN PROBLEMS, QUIRKS, AND IMPROVEMENT SUGGESTIONS
---------------------------------------------------
* The baseline command in autoseq is ambiguous: It actually /translates/ the
data. There should be separate baseline and translation flags.
* Assimilation of the peaks may have a problem because it inherits the
peak records from the individual traces. That is, it may be the case that
two lists point to the same PeakRec. This requires some investigation.
* For a series of peaks pairwise separated by less than some minimum
separation, exactly one peak is chosen. For series which span a region in which
more than one real peak exists, some peaks will be discarded. Therefore,
there's a balance between parameters which result in abundant peaks (thus
resulting in a large series of peaks in close pairwise proximity) and minimum
separation criteria which prune peaks with 'reasonable' separation. For small
minSeparation arguments (ie. <=5), this generally isn't a problem and only the
peaks which result from noise are tossed (as was originally desired).
* In several cases, I've not made new class where I probably should have.
For instance (ahem), I use the same CPeakList for both the peaks of individual
traces and the assimilated list. However, the assimilated list really has no
need for the statistical methods (in fact, their application to this list would
be meaningless). Pruning peaks should not be a tracefile function, however
this was necessary because the assimilated list needs to know about all 4 of
the individual trace's peak lists. Thus, these classes are really not as
absolutely modular as they could/should be.
* The ted source has sparse references to a 'bottom' variable. I've assumed
that this is only affect this has is to invert the trace and edit the reverse-
compliment of the sequence. I'm not aware of any other affects this has on
the trace data.
* Class hierarchy
The current hierarchy is quite simple and is described above. It has worked
well for prototyping this system and is functional even for non-protyping
purposes. However, I believe that a more abstract interpretation of the types
is now appropriate (and fairly easily done with the sources provided).
CSequence
more complete list operators and iterators (sort, doforeach, etc.)
CArray
sampling data
stat fx
sorting
histograms
derivatives
peak picking
CTrace: CArray
CTraceSet
collection of CTraces
orthogonalization
group calls to CArray methods (ie. CalcStats, PickPeaks)
resolve peaks
CTraceFile
CTraceSet
reading and writing tracefiles
Peaks could be stored in a CPeakList as is done currently, or in a
CSequence<>. Different peak recs should be used for trace v. set peaks.
Copy constructors for each class.